(not graded): The following are true/false questions. You don’t need to answer the questions. Just tell us which ones you can’t answer confidently in less than one minute. (You won’t be graded on this.) If you can’t answer at least 8, you should probably spend some extra time outside of class beefing up on elementary math. I would strongly suggest going through this math tutorial by Hal Daume: https://web.cs.ucdavis.edu/%7Ehpirsiav/courses/MLf22/math4ml.pdf
(a) $log(x) + log(y) = log(xy) $
Answer: True
(b) $log[ab^c] = log \ a + (log \ b)(log \ c)$
Answer: False; $log[ab^c] = log \ a + c(log \ b)$
(c) $ \frac{\partial}{\partial x} \sigma(x) = \sigma(x) \ \times \ (1-\sigma(x))$ where $\sigma(x) = \frac{1}{1+e^{-x}}$
Answer: False; $ \frac{\partial}{\partial x} \sigma(x) = \frac{\partial}{\partial x} \frac{1}{1+e^{-x}} = \frac{e^{x}}{(e^{x} +1)^{2}}$
(d) The distance between the point ($x_1, y_1$) and line $ax+by+c$ is $\frac{(ax_1+by_1+c)}{\sqrt{a^{2}+ b^{2}}}$
Answer: True
(e) $\frac{\partial}{\partial x} log \ x = - \frac{1}{x}$
Answer: False; $\frac{\partial}{\partial x} log \ x = \frac{1}{x}$
(f) $p(a|b) = \frac{p(a,b)}{p(b)}$
Answer: True; $p(a,b)$ is intersection of a and b
(g) $p(x | y, z) = p(x | y)p(x | z)$
Answer: False; $p(x | y, z) = \frac{p(x,y,z)}{p(y,z)} = \frac{ p(y|x,z)p(x|z)}{p(y|z)}$
(h) $C(n, k) = C(n−1, k −1) +C(n−1, k)$, where $C(n, k)$ is the number of ways of choosing $k$ objects from $n$
Answer: True
(i) $||\alpha u+v||^2 = \alpha ^2 ||u||^2 + ||v|^2$, where ||.|| denotes Euclidean norm, $\alpha$ is a scalar and $u$ and $v$ are vectors
Answer: True; If $u$ and $v$ are orthogonal aka $u.v = 0$
(j) $ |u^T v| \ge ||u|| \times ||v||$, where $|·|$ denotes absolute value and $u^⊤v$ is the dot product of $u$ and $v$
Answer: False; $ |u^T v| \le ||u|| \times ||v||$ (Cauchy-Schwarz Inequality)
(k) $ \int_{- \infty}^{\infty} dx \ exp \big[- \frac{\pi}{2} x^2 \big]= \sqrt{2}$
Answer: True
(not graded): Go though this Matlab tutorial by Stefan Roth: http://cs.brown.edu/courses/csci1950-g/docs/matlab/matlabtutorialcode.html
In class, we looked at an example where all the attributes were binary (i.e., yes/no valued). Consider an example where instead of the attribute “Morning?”, we had an attribute “Time” which specifies when the class begins.
(a) We can pick a threshold $\tau$ and use (Time < $\tau$) as a criteria to split the data in two. Explain how you might pick the optimal value of $\tau$.
Answer:
If given a dataset has a combination of binary and nonbinary attributes and a categorical target variable, then we need to pick the most useful attribute that gives the "best split" and maximum difference, we can consider Entropy, which can tell us the degree of randomness or uncertainties in our data that can be calculated using the formula below where $p_i$ is the probability of class i. $$E(S) = \sum_{i=1}^{c} -p_i log_2 p_i$$ We can then use a splitting criterion such as Information Gain, which measures the reduction of uncertainties in our data, which can be computed using the formula below, where the variable y denotes the target variable and the variable x represents some additional information or feature, such as Time, about the target variable. $$IG(x,y) = E(y) - E(y|x)$$ We want the addtional feature $x$ to help reduce the uncertainties around the target variable. Thus, the feature that maximimizes the information gain and reduces impurities can be used to make the decision. Once the useful attribute is determined, we can create a decision node that splits the data on the best attribute and then find the optimal value of $\tau$ to use for splitting the data into two.
To find the optimal value $\tau$ using for example, the attribute Time (which we can assume as the best attribute to use as a result the process mentioned above) to classify the data, we can then do the following,
Here $\tau_i$ can be some midpoint time that results in purest nodes. This brute force approach of trying out every single time midpoint would be computationally expensive. Thus, we can also convert the variable Time into a categorical variable and perform the following.
(b) In the decision tree learning algorithm discussed in class, once a binary attribute is used, the subtrees do not need to consider it. Explain why when there are continuous attributes this may not be the case.
Answer:
When there is a binary attribute, we can complete the split at the first level of the tree so the subtrees do not need to consider it. However, when there are continuous attributes, the subtrees need to consider it because it might require multiple tree levels to complete the split where there is a lowest possible impurity. Binary attributes result in a two-way split, whereas continuous attributes may result in multiway-split, where a leaf node of a decision can further form its own subtree and be split into new leaf nodes. In other words, the decision or outcome variable at the bottom of the tree may depend on other decisions or attributes further up the tree, which is why a continuous variable may be be considered by the subtrees.
Why memorizing the training data and doing table lookups is a bad strategy for learning? How do we prevent that in decision trees?
Answer:
There are various reasons why memorization is a bad strategy for learning. One reason is that memorizing the training data leads to overfitting. If a model is overfitted, then would not be able to generalize to unseen data. For example, if we train a polynomial regression model $y = \beta_0 + \beta_1x_1 + \beta_2 x_2^2 + \epsilon$ using a small dataset, then the model can memorize these three coefficients, predict the labels exactly, and reach zero training loss. The model would have a very high accuracy on the training data but the accuracy significantly decreases for a test data. Thus, memorizing training data and doing lookup tables only allows for predictions to be made on test data that is the same as the training data.
In order to prevent memorization or overfitting in decision trees, the following methods can be used.
After using these methods, the model must be evaluated using a validation set to ensure that the implementation of these methods has a positive effect i.e. higher accuracy, less error, etc.
What does the decision boundary of 1-nearest neighbor classifier for 2 points (one positive, one negative) look like?
Answer:
The decision boundary of 1-nearest neighbor classifier for 2 points (one positive, one negative) that belong to differenct classes would be piecewise linear bisector that is located equidistant from the two points. Since there are only two points, the decision boundary is very simple.
import plotly.express as px
import plotly.graph_objects as go
fig = px.scatter(x=[-5, 5], y=[5, 5])
fig.add_trace(go.Scatter(x=[0,0], y=[10,0], mode='lines', name='trend line', line=dict(color='red')))
fig.show(renderer='notebook')
Does the accuracy of a kNN classifier using the Euclidean distance change if you (a) translate the data (b) scale the data (i.e., multiply the all the points by a constant), or (c) rotate the data? Explain. Answer the same for a kNN classifier using Manhattan distance
Answer:
i. kNN using Euclidean distance
(a) translate the data
Answer:
No, the accuracy of a kNN classifier using the Euclidean distance does not change if we translate the data. When we translate the data, we are simply shifting the position of the data points by a fixed amount in space. This translation does not change the relative distances between the data points. As a result, if we use the Euclidean distance as the distance metric for the kNN classifier, the distance between any two points will remain the same even after translation.
For example, if we have two data points A and B with coordinates (2, 3) and (4, 5), respectively. The distance between these two points can be calculated using the Euclidean distance formula as follows:
$$distance \ \ = \ \ \sqrt{\big((4 - 2)^2 + (5 - 3)^2\big)} = \sqrt{(8)}$$If we translate both points by a fixed amount (dx, dy), their new coordinates become (2+dx, 3+dy) and (4+dx, 5+dy), respectively. The distance between the two points will still be the same:
$$distance \ \ = \ \ \sqrt{\big((4+dx - 2-dx)^2 + (5+dy - 3-dy)^2\big)} = \sqrt{(8)}$$(b) scale the data (i.e., multiply the all the points by a constant)
Answer:
No, the accuracy of a kNN classifier using the Euclidean distance does not change if we scale the data because scaling the data changes the scale of the data points, but it does not change their relative distances. Therefore, if we use the Euclidean distance as the distance metric for the kNN classifier, the distance between any two points will still be the same relative to each other after scaling.
For example, consider two data points A and B with coordinates (2, 3) and (4, 5), respectively. The distance between these two points can be calculated using the Euclidean distance formula as follows:
$$distance \ = \ \sqrt{((4 - 2)^2 + (5 - 3)^2)} = \sqrt{(8)}$$Now, suppose we scale the data points by a factor of 2, such that the new coordinates become (4, 6) and (8, 10), respectively. The distance between the two points will still be the same:
$$distance \ = \ \sqrt{((8 - 4)^2 + (10 - 6)^2)} = \sqrt{(32)} = \sqrt{(2^2 * 8)} = 2 * \sqrt{(8)}$$Thus, the accuracy of a kNN classifier using the Euclidean distance does not change if we scale the data, as the relative distances between the data points remain the same. However, if we use other distance metrics that are sensitive to the scale of the data points, such as the Mahalanobis distance, then the accuracy of the kNN classifier may be affected by scaling.
(c) rotate the data
Answer:
No ,rotation won't change the accuracy of kNN using Euclidean distance. Rotating the data does not generally result in changes in its length, shape, and angle measures so we can expect the calculated distances to be the same. With this, we can infer that the accuracy of kNN using Euclidean distance may not change even if we rotate the data. However, a slight change in kNN accuracy may still be possible depending on the data itself, the degree of rotation, or the type of distance measure we're using. Based on the calculated distance below, it appears that Euclidean distance is not as sensitive to data rotation as Manhattan distance.
If we consider 3 points $x = (0, 1), y = (1, 0)$, and $z = (1, -2)$, then Euclidean distance, using $y$ as the pivot point, between the 3 points would be
$$||x-y||_2 = \sqrt{(0-1)^2 + (1-0)^2}= \sqrt{2}$$$$||y-z||_2 = \sqrt{(1-1)^2 +(0-(-2))^2} = \sqrt{4}=2$$$$||x-y||_2 < ||y-z||_2$$which means that $x$ and $z$ are not the same distance from $y$.
Then, the new co-ordinates after the 45 degrees rotation would be given by, $ \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} $ $ \begin{bmatrix} x \\ y\\ \end{bmatrix} $
Hence, x, y and z get transformed to $x^*$, $y^*$ and $z^*$ as follows,
$$x^* = Ax = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 0 \\ 1\\ \end{bmatrix} = \begin{bmatrix} \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}}\\ \end{bmatrix}$$$$y^* = Ay = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 1 \\ 0\\ \end{bmatrix} = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \\ \end{bmatrix}$$$$z^* = Az = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 1 \\ -2\\ \end{bmatrix} = \begin{bmatrix} \frac{3}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}}\\ \end{bmatrix}$$Thus, the new Euclidean distance after a 45-degree rotation becomes $$||x^*-y^*||_2 = \sqrt{\Bigg(\frac{-1}{\sqrt{2}} - \frac{1}{\sqrt{2}}\Bigg)^2+\Bigg(\frac{1}{\sqrt{2}} - \frac{1}{\sqrt{2}}\Bigg)^2} =\sqrt{2}$$
$$||y^*-z^*||_2 =\sqrt{\Bigg(\frac{1}{\sqrt{2}} - \frac{3}{\sqrt{2}}\Bigg)^2+\Bigg(\frac{1}{\sqrt{2}} - \frac{-1}{\sqrt{2}} \Bigg)^2} =\sqrt{4}=2$$Hence, ordering is preserved with Euclidean distance after 45-degree rotation since we get $$||x^*-y^*||_2< ||y^*-z^ *||_2 $$
The points x, y and z get transformed to $x^{**}$, $y^{**}$ and $z^{**}$ as follows,
$$x^{**} = Bx = \begin{bmatrix} 0 & -1\\ 1 & 0\\ \end{bmatrix} \begin{bmatrix} 0 \\ 1\\ \end{bmatrix} = \begin{bmatrix} -1 \\ 0\\ \end{bmatrix}$$$$y^{**} = By = \begin{bmatrix} 0 & -1\\ 1 & 0\\ \end{bmatrix} \begin{bmatrix} 1 \\ 0\\ \end{bmatrix} = \begin{bmatrix} 0 \\ 1\\ \end{bmatrix}$$$$z^{**} = Bz = \begin{bmatrix} 0 & -1\\ 1 & 0\\ \end{bmatrix} \begin{bmatrix} 1 \\ -2\\ \end{bmatrix} = \begin{bmatrix} 2 \\ 1\\ \end{bmatrix}$$Thus, the new Euclidean distance after a 90-degree rotation becomes $$||x^{**}-y^{**}||_2 = \sqrt{(-1 - 0)^2+(0 - 1)^2} =\sqrt{2}$$
$$||y^{**}-z^{**}||_2 =\sqrt{(0 - 2)^2+(1 - 1)^2} =\sqrt{4}=2$$Hence, ordering is still preserved with Euclidean distance after 90-degree rotation since we get $$||x^{**}-y^{**}||_2< ||y^{**}-z^{**}||_2 $$
Thus, the new Euclidean distance after a 90-degree rotation becomes $$||x^{***}-y^{***}||_2 = \sqrt{(0 - (-1))^2+(-1 - 0)^2} =\sqrt{2}$$
$$||y^{***}-z^{***}||_2 =\sqrt{(-1 - (-1))^2+(0 - 2)^2} =\sqrt{4}=2$$Hence, ordering is still preserved with Euclidean distance after 180-degree rotation since we get $$||x^{***}-y^{***}||_2< ||y^{***}-z^{***}||_2 $$
ii. kNN using Manhattan distance
(a) translate the data
Answer:
Similar to the answer provided in part (i.a) above, translating the data should not change the accuracy of kNN using Manhattan distance. if we consider three 2D points $x=(0,1), y=(1,2)$, and $z=(2,1)$ with features x and y, then we can compute the Manhattan distances between the points to see how they change before and after translation. If we translate these 3 points by adding a constant=1 to one of the features (let's choose feature X), then we would get the transformed versions $x_1=(1,1), y_1=(2,2)$, and $z_1=(3,1)$. The Euclidean distance of the original and translated points, using $y$ and $y_1$ as the pivot point, would be
$$||x-y||_1 = |0-1|+ |1-2|= 2$$$$||y-z||_1 = |1-2|+|2-1| = 2$$$$||x_1-y_1||_1 = |1-2| + |1-2|= 2$$$$||y_1-z_1||_1 = |2-3| +|2-2| = 1$$Thus, the distances does change after translation since $||x-y||_1 = ||y-z||_1$ becomes $||x_1-y_1||_1 > ||y_1-z_1||_1$
If we add a constant=1 to both features X and Y, we get the new translated points as $x_2=(1,2), y_2=(2,3)$, and $z_2=(3,2)$, then the Euclidean distances between these 3 points are
$$||x_2-y_2||_1 = |1-2| + |3-2|= 2$$$$||y_2-z_2||_1 = |3-2| +|2-3| = 2$$Thus, the distance ordering does not change after translation since $||x-y||_1 = ||y-z||_1$ remains $||x_1-y_1||_1 = ||y_1-z_1||_1$
Based on these, we can say that translating the data can change the accuracy of kNN using Manhattan distance depending on how we are translating the data.
(b) scale the data (i.e., multiply the all the points by a constant)
Answer:
The accuracy of a kNN classifier using the Manhattan distance does not change if we scale the data, as the relative distances between the data points remain the same. The Manhattan distance is calculated as the sum of absolute differences between the coordinates of two data points. Scaling the data will change the absolute differences between the coordinates, but it will not change their relative differences. Therefore, if we use the Manhattan distance as the distance metric for the kNN classifier, the relative distances between any two points will still be the same after scaling.
For example, consider two data points A and B with coordinates (2, 3) and (4, 5), respectively. The Manhattan distance between these two points can be calculated as follows:
$$distance \ = \ |4 - 2| + |5 - 3| = 4$$Now, suppose we scale the data points by a factor of 2, such that the new coordinates become (4, 6) and (8, 10), respectively. The Manhattan distance between the two points will still be the same:
$$distance \ = \ |8 - 4| + |10 - 6| = 2*4 = 8$$(c) rotate the data
Answer:
Similar to the reason provided in part (i.c), the accuracy of kNN classifier using Manhattan distance can change if the data is rotated. However, this change should depend on the degree of rotation and the data itself.
If we consider 3 points $x = (0, 1), y = (1, 0)$, and $z = (1, -2)$, then the Manhattan distance, using $y$ as the pivot point, between the 3 points would be
$$||x-y||_1= ||y-z||_1 = 2$$$$|0-1| + |1-0|= |1-1| +|0-2|= 2$$which means that $x$ and $z$ are the same distance from $y$. If we consider a 45-degree rotation matrix $ A = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \quad $
Then, the new co-ordinates after the 45 degrees rotation would be given by, $ \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} $ $ \begin{bmatrix} x \\ y\\ \end{bmatrix} $
Hence, x, y and z get transformed to $x*$, $y*$ and $z*$ as follows,
$$x* = Ax = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 0 \\ 1\\ \end{bmatrix} = \begin{bmatrix} \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}}\\ \end{bmatrix}$$$$y* = Ay = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 1 \\ 0\\ \end{bmatrix} = \begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \\ \end{bmatrix}$$$$z* = Az = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\ \end{bmatrix} \begin{bmatrix} 1 \\ -2\\ \end{bmatrix} = \begin{bmatrix} \frac{3}{\sqrt{2}} \\ \frac{-1}{\sqrt{2}}\\ \end{bmatrix}$$Thus, the new Manhattan distance after a 45-degree rotation becomes $$||x*-y*||_1 = \Bigg|\frac{-1}{\sqrt{2}} - \frac{1}{\sqrt{2}}\Bigg|+\Bigg|\frac{1}{\sqrt{2}} - \frac{1}{\sqrt{2}} \Bigg| = \sqrt{2}$$
$$||y*-z*||_1 =\Bigg|\frac{1}{\sqrt{2}} - \frac{3}{\sqrt{2}}\Bigg|+\Bigg|\frac{1}{\sqrt{2}} - \frac{-1}{\sqrt{2}} \Bigg| =\frac{4}{\sqrt{2}}$$Hence, ordering is not preserved with Manhatan distance after rotation since we get $$||x*-y*||_1< ||y*-z*||_1 $$
Implement kNN in Matlab or Python for handwritten digit classification and submit all codes and plots:
(a) Download MNIST digit dataset (60,000 training and 10,000 testing data points) and the starter code from the course page. Each row in the matrix represents a handwritten digit image. The starter code shows how to visualize an example data point in Matlab. The task is to predict the class (0 to 9) for a given test image, so it is a 10-way classification problem.
# Import dependencies
import pandas as pd
import nbconvert
import numpy as np
from scipy.io import loadmat
import matplotlib.pyplot as plt
import operator
from operator import itemgetter
import plotly.express as px
import timeit
#Loading the data
M = loadmat('data\MNIST_digit_data.mat')
images_train,images_test,labels_train,labels_test= M['images_train'],M['images_test'],M['labels_train'],M['labels_test']
#just to make all random sequences on all computers the same.
np.random.seed(1)
#randomly permute data points
inds = np.random.permutation(images_train.shape[0])
images_train = images_train[inds]
labels_train = labels_train[inds]
inds = np.random.permutation(images_test.shape[0])
images_test = images_test[inds]
labels_test = labels_test[inds]
#if you want to use only the first 1000 data points.
images_train = images_train[0:1000,:]
labels_train = labels_train[0:1000,:]
#show the 10'th train image
i=10
im = images_train[i,:].reshape((28,28),order='F')
plt.imshow(im)
plt.title('Class Label:'+str(labels_train[i][0]))
plt.show()
print(images_train.shape, labels_train.shape)
print(images_test.shape, labels_test.shape)
(1000, 784) (1000, 1) (10000, 784) (10000, 1)
(b) Write a Matlab or Python function that implements kNN for this task and reports the accuracy for each class (10 numbers) as well as the average accuracy (one number).
[acc acc_av] = kNN(images train, labels train, images test, labels test, k)
where acc is a vector of length 10 and acc_av is a scalar. Look at a few correct and wrong predictions to see if it makes sense. To speed it up, in all experiments, you may use only the first 1000 testing images.
# Class ‘KNN’ with initialized instances
class KNN:
def __init__(self, k):
self.k = k
def fit_kNN(self, images_train, labels_train):
self.images_train = images_train
self.labels_train = labels_train
def l2_distance(self, x1, x2):
return np.sqrt(np.sum((x1-x2)**2))
def predict_kNN(self, images_test):
pred_knn = [] # a list to store predictions
#iterate through test data
for i in range(len(images_test)):
#calculate distances between test and training data
d = np.array([self.l2_distance(images_test[i], x_t) for x_t in self.images_train])
#sort distances and only keep k neighbors
sorted_d = d.argsort()[:self.k]
#initialize dict to count class occurrences in labels_train
cnt_neighbors = {}
#iterate over neighbors
for j in sorted_d:
if self.labels_train[j][0] in cnt_neighbors:
cnt_neighbors[self.labels_train[j][0]] += 1
else:
cnt_neighbors[self.labels_train[j][0]] = 1
#sort occurence values from most to least
sorted_neigh_cnt = sorted(cnt_neighbors.items(), key=operator.itemgetter(1), reverse=True)
pred_knn.append(sorted_neigh_cnt[0][0])
return pred_knn
def score_kNN(self, images_test, labels_test):
pred = self.predict_kNN(images_test)
#convert labels_test as nested list to flat array list
flat_labels_test = np.array([ item for elem in labels_test for item in elem])
return (pred == flat_labels_test).sum() / len(labels_test)
def class_acc(self, images_test, labels_test):
pred = self.predict_kNN(images_test)
#convert labels_test as nested list to flat array list
flat_labels_test = np.array([ item for elem in labels_test for item in elem])
cm = pd.crosstab(flat_labels_test,pred)
#print(cm)
class_cm = np.array((cm.max(axis=1)/cm.sum(axis=1)))
# acc_av = sum(np.diag(cm)) / len(labels_test)
acc_av = sum(class_cm) / len(class_cm) #yields acc similar to score_kNN's result in function above
acc = [[class_cm.tolist()], acc_av]
return acc #classwise accuracy = number of correct predictions / total predictions for a class
%%time
#Test class
model = KNN(k=3)
model.fit_kNN(images_train, labels_train)
pred= model.predict_kNN(images_test)
#model.score_kNN(images_test, labels_test)
#return accuracy for each class (10 numbers) + average accuracy
acc= model.class_acc(images_test, labels_test)
acc
CPU times: total: 4min 55s Wall time: 5min 7s
[[[0.9683673469387755, 0.9947136563876652, 0.8517441860465116, 0.8702970297029703, 0.8462321792260692, 0.8295964125560538, 0.9613778705636743, 0.8949416342412452, 0.8049281314168378, 0.8909811694747275]], 0.8913179616554532]
#average accuracy
model.score_kNN(images_test, labels_test)
0.8931
def l2_dist( x1, x2):
return np.sqrt(np.sum((x1-x2)**2))
# Function that implements kNN to predict the
# class (0 to 9) for a given test image, so it is a 10-way classification problem
# and reports the accuracy for each class (10 numbers) as well as the average accuracy (one number).
def KNN(images_train, labels_train, images_test, labels_test, k):
pred_knn = [] # a list to store predictions
#iterate through test data
for i in range(len(images_test)):
##calculate distances between test and training data
d = np.array([l2_dist(images_test[i], x_t) for x_t in images_train])
#sort distances and only keep k neighbors
sorted_d = d.argsort()[:k]
#initialize dict to count class occurrences in labels_train
cnt_neighbors = {}
#iterate over neighbors
for j in sorted_d:
if labels_train[j][0] in cnt_neighbors:
cnt_neighbors[labels_train[j][0]] += 1
else:
cnt_neighbors[labels_train[j][0]] = 1
#sort occurence values from most to least
sorted_neigh_cnt = sorted(cnt_neighbors.items(), key=operator.itemgetter(1), reverse=True)
pred_knn.append(sorted_neigh_cnt[0][0])
##classwise accuracy and average accuracy
#convert labels_test as nested list to flat array list
flat_labels_test = [ item for elem in labels_test for item in elem]
cm = pd.crosstab(flat_labels_test,pred_knn)
#print(cm)
class_acc = np.array((cm.max(axis=1)/cm.sum(axis=1)))
#overall_acc = sum(pred_knn == np.array(flat_labels_test)) / len(labels_test)
# acc_av = sum(np.diag(cm)) / len(labels_test)
acc_av = sum(class_acc) / len(class_acc) #yields acc similar to overall_acc
acc = [class_acc.tolist(), acc_av]
return acc #classwise accuracy = number of correct predictions / total predictions for a class
%%time
#Test function
KNN(images_train, labels_train, images_test, labels_test, 3)
CPU times: total: 2min 26s Wall time: 2min 33s
[[0.9683673469387755, 0.9947136563876652, 0.8517441860465116, 0.8702970297029703, 0.8462321792260692, 0.8295964125560538, 0.9613778705636743, 0.8949416342412452, 0.8049281314168378, 0.8909811694747275], 0.8913179616554532]
(c) For k = 1, change the number of training data points (30 to 10,000) to see the change in performance. Plot the average accuracy for 10 different dataset sizes. You may use command logspace in Matlab. In the plot, x-axis is for the number of training data and y-axis is for the accuracy.
img_train,img_test,lbl_train,lbl_test= M['images_train'],M['images_test'],M['labels_train'],M['labels_test']
# Change the number of training data points (30 to 10,000)
#set 10 different sizes
size = np.logspace(np.log10(30) , np.log10(10000) , num=10)
size = np.around(size)
size = size.astype(int)
size
#initialize empty dictionary for different samples
img_dict = {}
lbl_dict = {}
#loop to get different samples into samples_dict
for s in range(len(size)):
inds = np.random.permutation(img_train.shape[0]) #get random indices for each size
X = img_train[inds]
y = lbl_train[inds]
#n = sizes[s]
img_dict["img_train{0}".format(size[s])] = X[0:size[s],:]
lbl_dict["lbl_train{0}".format(size[s])] = y[0:size[s],:]
#print shaped of each sample
for key in img_dict:
print(img_dict[key].shape)
(30, 784) (57, 784) (109, 784) (208, 784) (397, 784) (756, 784) (1442, 784) (2750, 784) (5244, 784) (10000, 784)
#print shaped of each sample
for key in lbl_dict:
print(lbl_dict[key].shape)
(30, 1) (57, 1) (109, 1) (208, 1) (397, 1) (756, 1) (1442, 1) (2750, 1) (5244, 1) (10000, 1)
%%time
# Fit k=1 for different training sizes
acc_av_dict={} # initialize an empty dictionary for average accuracies
for s in range(len(size)):
avg_acc = KNN(img_dict["img_train{}".format(size[s])], lbl_dict["lbl_train{}".format(size[s])] , images_test, labels_test, 1)
acc_av_dict[size[s]] = avg_acc[1]
print("size {} done".format(size[s]))
size 30 done size 57 done size 109 done size 208 done size 397 done size 756 done size 1442 done size 2750 done size 5244 done size 10000 done CPU times: total: 52min 19s Wall time: 55min 8s
#convert dict to df
avg_accdf = pd.DataFrame.from_dict(acc_av_dict.items())
avg_accdf.columns = ['Number of training data', 'Accuracy']
avg_accdf
| Number of training data | Accuracy | |
|---|---|---|
| 0 | 30 | 0.561468 |
| 1 | 57 | 0.615278 |
| 2 | 109 | 0.716249 |
| 3 | 208 | 0.802582 |
| 4 | 397 | 0.839968 |
| 5 | 756 | 0.871002 |
| 6 | 1442 | 0.896835 |
| 7 | 2750 | 0.921617 |
| 8 | 5244 | 0.932798 |
| 9 | 10000 | 0.949314 |
# Plot the average accuracy for 10 different dataset sizes
# List arguments
fig = px.line(x=avg_accdf['Number of training data'], y=avg_accdf['Accuracy'],title="Accuracy of kNN using k=1 for different training sample sizes")
fig.update_layout(template = 'plotly_dark',
xaxis_title="training sample size",
yaxis_title="kNN accuracy",
)
fig.show(renderer='notebook')
Observation(s):
The average accuracy is logarithmic. As the sample size increases, the average accuracy coverges to 1.
(d) Show the effect of k on the accuracy. Make a plot similar to the above one with multiple colored curves on the top of each other (each for a particular k in [1 2 3 5 10].) You may use command legend in Matlab to name different colors.
%%time
k_vals = [1, 2, 3, 5, 10] #set diff values of k
acc_av_dict_per_k= {}
for k in range(len(k_vals)):
for s in range(len(size)):
avg_acc = KNN(img_dict["img_train{}".format(size[s])], lbl_dict["lbl_train{}".format(size[s])] , images_test, labels_test, k_vals[k])
acc_av_dict_per_k['{}_{}'.format(k_vals[k], size[s])]= avg_acc[1]
print('{}_{} done'.format(k_vals[k], size[s]))
1_30 done 1_57 done 1_109 done 1_208 done 1_397 done 1_756 done 1_1442 done 1_2750 done 1_5244 done 1_10000 done 2_30 done 2_57 done 2_109 done 2_208 done 2_397 done 2_756 done 2_1442 done 2_2750 done 2_5244 done 2_10000 done 3_30 done 3_57 done 3_109 done 3_208 done 3_397 done 3_756 done 3_1442 done 3_2750 done 3_5244 done 3_10000 done 5_30 done 5_57 done 5_109 done 5_208 done 5_397 done 5_756 done 5_1442 done 5_2750 done 5_5244 done 5_10000 done 10_30 done 10_57 done 10_109 done 10_208 done 10_397 done 10_756 done 10_1442 done 10_2750 done 10_5244 done 10_10000 done CPU times: total: 4h 16min 49s Wall time: 6h 20min 40s
#convert dict to df
avg_acc_kdf = pd.DataFrame.from_dict(acc_av_dict_per_k.items())
avg_acc_kdf.columns = ['ID','acc']
avg_acc_kdf['ID'] = avg_acc_kdf['ID'].astype('string')
avg_acc_kdf[['k','size']] = avg_acc_kdf.ID.str.split('_', expand=True)
avg_acc_kdf.dtypes
avg_acc_kdf
| ID | acc | k | size | |
|---|---|---|---|---|
| 0 | 1_30 | 0.561468 | 1 | 30 |
| 1 | 1_57 | 0.615278 | 1 | 57 |
| 2 | 1_109 | 0.716249 | 1 | 109 |
| 3 | 1_208 | 0.802582 | 1 | 208 |
| 4 | 1_397 | 0.839968 | 1 | 397 |
| 5 | 1_756 | 0.871002 | 1 | 756 |
| 6 | 1_1442 | 0.896835 | 1 | 1442 |
| 7 | 1_2750 | 0.921617 | 1 | 2750 |
| 8 | 1_5244 | 0.932798 | 1 | 5244 |
| 9 | 1_10000 | 0.949314 | 1 | 10000 |
| 10 | 2_30 | 0.561468 | 2 | 30 |
| 11 | 2_57 | 0.615278 | 2 | 57 |
| 12 | 2_109 | 0.716249 | 2 | 109 |
| 13 | 2_208 | 0.802582 | 2 | 208 |
| 14 | 2_397 | 0.839968 | 2 | 397 |
| 15 | 2_756 | 0.871002 | 2 | 756 |
| 16 | 2_1442 | 0.896835 | 2 | 1442 |
| 17 | 2_2750 | 0.921617 | 2 | 2750 |
| 18 | 2_5244 | 0.932798 | 2 | 5244 |
| 19 | 2_10000 | 0.949314 | 2 | 10000 |
| 20 | 3_30 | 0.551251 | 3 | 30 |
| 21 | 3_57 | 0.615399 | 3 | 57 |
| 22 | 3_109 | 0.725491 | 3 | 109 |
| 23 | 3_208 | 0.791067 | 3 | 208 |
| 24 | 3_397 | 0.833035 | 3 | 397 |
| 25 | 3_756 | 0.878071 | 3 | 756 |
| 26 | 3_1442 | 0.901486 | 3 | 1442 |
| 27 | 3_2750 | 0.924864 | 3 | 2750 |
| 28 | 3_5244 | 0.939595 | 3 | 5244 |
| 29 | 3_10000 | 0.951342 | 3 | 10000 |
| 30 | 5_30 | 0.589869 | 5 | 30 |
| 31 | 5_57 | 0.606772 | 5 | 57 |
| 32 | 5_109 | 0.720239 | 5 | 109 |
| 33 | 5_208 | 0.777778 | 5 | 208 |
| 34 | 5_397 | 0.823589 | 5 | 397 |
| 35 | 5_756 | 0.876382 | 5 | 756 |
| 36 | 5_1442 | 0.898421 | 5 | 1442 |
| 37 | 5_2750 | 0.925827 | 5 | 2750 |
| 38 | 5_5244 | 0.938605 | 5 | 5244 |
| 39 | 5_10000 | 0.950336 | 5 | 10000 |
| 40 | 10_30 | 0.639373 | 10 | 30 |
| 41 | 10_57 | 0.602814 | 10 | 57 |
| 42 | 10_109 | 0.665238 | 10 | 109 |
| 43 | 10_208 | 0.745905 | 10 | 208 |
| 44 | 10_397 | 0.794459 | 10 | 397 |
| 45 | 10_756 | 0.858533 | 10 | 756 |
| 46 | 10_1442 | 0.886947 | 10 | 1442 |
| 47 | 10_2750 | 0.917808 | 10 | 2750 |
| 48 | 10_5244 | 0.932122 | 10 | 5244 |
| 49 | 10_10000 | 0.946892 | 10 | 10000 |
#Plot
fig = px.line(x=avg_acc_kdf['size'], y=avg_acc_kdf['acc'],
color = avg_acc_kdf['k'],
title="Accuracy of kNN for different k and training sample sizes")
fig.update_layout(template = 'plotly_dark',
xaxis_title="training sample size",
yaxis_title="kNN accuracy",
legend_title_text='k'
)
fig.show(renderer='notebook')
Observation(s):
It appears that at sample size 30, k=10 has the highest accuracy but it get the lowest accuracy as the sample size increases. The values k=3,5 have the highest accuracy at the highest sample size. There is also an overall decrease in accuracy in different values of k at sample size 57.
(e) Choose the best k for 2,000 total training data by splitting the training data into two halves (the first for training and the second for validation). You may plot the average accuracy wrt k for this. Note that in this part, you should not use the test data. You may search for k in this list: [1 2 3 5 10].
# For the last part only, please choose 2000 training data randomly and use those only
#split training data into training and validation
#choose any 2k images out of 60k and split them in half
inds = np.random.permutation(images_train.shape[0])
images_train = images_train[inds]
labels_train = labels_train[inds]
images_train = images_train[0:2000,:]
labels_train = labels_train[0:2000,:]
#training images and labels
i =np.random.permutation(images_train.shape[0])
images_train = images_train[i]
labels_train = labels_train[i]
img_tr = images_train[0:1000,:]
lbl_tr = labels_train[0:1000,:]
i =np.random.permutation(images_train.shape[0])
images_train = images_train[i]
labels_train = labels_train[i]
img_vl = images_train[0:1000,:]
lbl_vl = labels_train[0:1000,:]
%%time
#use training data to find nearest neighbors
#use validation data to find best k
cv_dict= {}
folds = 2
img_training = img_tr
lbl_training = lbl_tr
img_validation = img_vl
lbl_validation = lbl_vl
for f in range(folds):
print('fold {} running'.format(f))
for k in range(len(k_vals)):
acc = KNN(img_training, lbl_training, img_validation, lbl_validation, k_vals[k])
cv_dict['{}_{}'.format(f, k_vals[k])]=acc[1]
print('k = {} done'.format(k_vals[k]))
img_training = img_vl
lbl_training = lbl_vl
img_validation = img_tr
lbl_validation = lbl_tr
fold 0 running k = 1 done k = 2 done k = 3 done k = 5 done k = 10 done fold 1 running k = 1 done k = 2 done k = 3 done k = 5 done k = 10 done CPU times: total: 2min 5s Wall time: 2min 8s
#convert dict to df
cv_accdf = pd.DataFrame.from_dict(cv_dict.items())
cv_accdf.columns = ['id', 'Accuracy']
cv_accdf['id'] = cv_accdf['id'].astype('string')
cv_accdf[['fold','k']] = cv_accdf.id.str.split('_', expand=True)
cv_accdf
| id | Accuracy | fold | k | |
|---|---|---|---|---|
| 0 | 0_1 | 1.000000 | 0 | 1 |
| 1 | 0_2 | 1.000000 | 0 | 2 |
| 2 | 0_3 | 0.970020 | 0 | 3 |
| 3 | 0_5 | 0.929642 | 0 | 5 |
| 4 | 0_10 | 0.895525 | 0 | 10 |
| 5 | 1_1 | 1.000000 | 1 | 1 |
| 6 | 1_2 | 1.000000 | 1 | 2 |
| 7 | 1_3 | 0.970020 | 1 | 3 |
| 8 | 1_5 | 0.929642 | 1 | 5 |
| 9 | 1_10 | 0.895525 | 1 | 10 |
#average accuracies per k
cv_avgaccdf = cv_accdf.groupby('k', as_index=False)['Accuracy'].mean()
cv_avgaccdf.sort_values(by=['Accuracy'], ascending=False, inplace=True)
cv_avgaccdf
| k | Accuracy | |
|---|---|---|
| 0 | 1 | 1.000000 |
| 2 | 2 | 1.000000 |
| 3 | 3 | 0.970020 |
| 4 | 5 | 0.929642 |
| 1 | 10 | 0.895525 |
#Plot
fig = px.line(x=cv_avgaccdf['k'], y=cv_avgaccdf['Accuracy'],
title="Accuracy of 2-fold cross-validate kNN using different k values")
fig.update_layout(template = 'plotly_dark',
xaxis_title="k",
yaxis_title="kNN accuracy",
)
fig.show(renderer='notebook')
Observation(s):
We can see that at lower values of k (k=1,2) , the KNN tends to closely follow the training data and thus shows a perfectly high training score. This is overfitted so the optimal k would be k=3,4.